html output file, and also the Rmd file.Melbourne property prices have taken their biggest hit since 2012, falling by almost 2 per cent in the past three months Jim Malo, Jul 26 2018, Domain
This assignment explores the data provided on Melbourne house prices by Anthony Pino. The goal is to examine whether housing prices have cooled in Melbourne, and help Anthony decide whether it is time to buy a two bedroom apartment in Northcote.
The code pulls a google map centered on Melbourne, and overlays the points corresponding to the lat/long of the properties. There is a larger than Melbourne collection region, and the zoom accommodates the greater Melbourne region.
The code filters on the suburb and units. The trick in the imputation is to use the round function to generate integer imputed values.
nc <- mh %>% filter(Suburb == "Northcote", Type == "u")
ggplot(nc, aes(x=Date, y=Price)) +
geom_point() + geom_smooth(se=F) +
facet_wrap(~Bedroom2, ncol=4)
library(naniar)
nc_ms <- nc %>% bind_shadow()
br2 <- lm(Bedroom2~Rooms, data=nc_ms)
nc_ms <- nc_ms %>%
mutate(Bedroom2=ifelse(is.na(Bedroom2),
round(predict(br2, new=nc_ms), 0), Bedroom2))
ggplot(nc_ms, aes(x=Date, y=Price)) +
geom_point() + geom_smooth(se=F) +
facet_wrap(~Bedroom2, ncol=4)
We learn that there are not many 3 bedroom units in Northcote. For 1 and 2 bedroom units price has been pretty steady, with 2 bedroom units perhaps gradually increasing in price. You would need to be prepared to pay about $600,000 to buy a two bedroom unit. A price of around $500k could be a bargain. One bedroom units are worth around $350k.
The code imputes the missing bedroom values using a linear model on rooms. It then focuses only on two bedroom units, for suburbs with at least 30 records. (30 might be too high, so a lower choice could be reasonable too.)
mh_ms <- mh %>% bind_shadow()
br2 <- lm(Bedroom2~Rooms, data=mh_ms)
mh_ms <- mh_ms %>%
mutate(Bedroom2=ifelse(is.na(Bedroom2),
round(predict(br2, new=mh_ms), 0), Bedroom2))
mh_ms %>% filter(Type == "u", Bedroom2 == 2) %>% count(Suburb, sort = TRUE) %>% ggplot(aes(x=n)) + geom_histogram()
keep <- mh_ms %>% filter(Type == "u", Bedroom2 == 2) %>%
count(Suburb, sort = TRUE) %>%
filter(n > 30)
mh_u <- mh_ms %>% filter(Suburb %in% keep$Suburb) %>%
mutate(days = as.numeric(Date - ymd("2016-01-28")))
library(purrr)
by_suburb <- mh_u %>%
select(Suburb, Date, Price, days) %>%
group_by(Suburb) %>%
nest()
by_suburb <- by_suburb %>%
mutate(
model = purrr::map(data, ~ lm(Price ~ days,
data = .))
)
suburb_coefs <- by_suburb %>%
unnest(model %>% purrr::map(broom::tidy))
suburb_coefs <- suburb_coefs %>%
select(Suburb, term, estimate) %>%
spread(term, estimate) %>%
rename(intercept = `(Intercept)`)
#head(suburb_coefs)
p <- ggplot(suburb_coefs, aes(x=intercept, y=days,
label=Suburb)) +
geom_point(alpha=0.5, size=2)
library(plotly)
ggplotly(p)
Across all of these suburbs the price of two bedroom units has been increasing. The price increases are pretty staggering. For example, in Malvern, starting from an average price of about $1.6mil in January 2016, prices have been increasing by $1200 PER DAY!!! Windsor started off with an average price around $670k and has increased by about $800 per day! On the other hand, there are three suburbs that have seen a decrease in price. Caulield North started at an average price about $1mil and it has been dropping by about $200 per day.
suburb_fit <- by_suburb %>%
unnest(model %>%
purrr::map(broom::glance))
p1 <- ggplot(suburb_fit, aes(x=r.squared)) + geom_histogram()
bestfit <- suburb_fit %>% filter(r.squared > 0.08)
mh_u_sub <- mh_u %>% filter(Suburb %in% bestfit$Suburb)
p2 <- ggplot(data=mh_u_sub, aes(x=Date, y=Price)) +
geom_point() + geom_smooth(method="lm", se=FALSE) +
ggtitle(mh_u_sub$Suburb)
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
All of the models are very weak! R2 ranges from 0.0-0.1, so at most 10% of the variation is explained. The best model corresponds to Windsor. Prices of 2 bedroom units in Windsor range from about $250k-$2.5mil! Prices have gone from $750k to $1.25mil on average in this two year period. However, there are still apartments sold for $500k this year.
The code computes the proportion of properties in the PI or VB categories for each month. This is the proportion of properties that did not sell.
#mh_u %>% count(Method, sort=TRUE)
mh_u_mth <- mh_u %>%
mutate(year = year(Date), month = month(Date)) %>%
group_by(Suburb, year, month) %>%
count(Method) %>%
mutate(p = n/sum(n)) %>%
filter(Method %in% c("PI", "VB")) %>%
mutate(time = (year-2016)*12+month)
p <- ggplot(mh_u_mth, aes(x=time, y=p, label=Suburb)) +
geom_smooth() +
geom_point() +
facet_wrap(~Method)
ggplotly(p)
The properties were aggregated to month, to group enough properties together to make examining proportions reasonable. The proportion of properties that were passed in at auction has been fairly stable over this time period. The properties with vendor bids appears to have been increasing a little over the recent months. There are some suburbs with relly high rates of no sales.
This code imputes the missings for bathroom, filters to the suburbs of interest, subsets to a set of bedrooms and bathrooms where there is enough data, and then fits some models.
ba2 <- lm(Bathroom~Rooms, data=mh_ms)
mh_ms <- mh_ms %>%
mutate(Bathroom=ifelse(is.na(Bathroom),
round(predict(ba2, new=mh_ms), 0), Bathroom))
monash <- mh_ms %>% filter(Suburb %in% c("Notting Hill", "Glen Waverley",
"Clayton", "Clayton South","Oakleigh East", "Huntingdale",
"Mount Waverley"),
Type == "h") %>%
select(Suburb, Price, Rooms, Date, Bedroom2, Bathroom, Car, Landsize) %>%
mutate(day=as.numeric(Date)) %>%
mutate(day=day-min(day))
monash <- monash %>% filter(Bedroom2 > 2, Bedroom2<5, Bathroom<3)
ggplot(monash, aes(x=Date, y=Price)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
facet_grid(Bathroom~Bedroom2)
library(broom)
monash_fit <- lm(Price~day+Rooms+Bedroom2+Bathroom+Car+Landsize, data=monash)
tidy(monash_fit)
term estimate std.error statistic p.value
1 (Intercept) 375121.2044 189450.4873 1.9800488 4.941296e-02
2 day -376.2263 256.8410 -1.4648222 1.449316e-01
3 Rooms 53345.5305 179882.6259 0.2965574 7.671890e-01
4 Bedroom2 -47745.2684 183785.6009 -0.2597879 7.953614e-01
5 Bathroom 104562.8733 55160.6243 1.8956071 5.981511e-02
6 Car -94667.4427 28842.4340 -3.2822279 1.264606e-03
7 Landsize 1421.7348 146.5577 9.7008523 8.685319e-18
glance(monash_fit)
r.squared adj.r.squared sigma statistic p.value df logLik
1 0.4122985 0.3902597 294654.6 18.70784 1.993047e-16 7 -2336.512
AIC BIC deviance df.residual
1 4689.023 4713.967 1.389141e+13 160
monash_fit2 <- lm(Price~day+Bathroom+Car+Landsize, data=monash)
tidy(monash_fit2)
term estimate std.error statistic p.value
1 (Intercept) 397445.3720 134337.3833 2.958561 3.554109e-03
2 day -373.8429 253.7763 -1.473120 1.426591e-01
3 Bathroom 105225.3067 47108.2139 2.233693 2.687241e-02
4 Car -94277.2172 28572.1769 -3.299616 1.190568e-03
5 Landsize 1412.7699 139.7539 10.108983 6.222246e-19
glance(monash_fit2)
r.squared adj.r.squared sigma statistic p.value df logLik
1 0.4119556 0.397436 292915.5 28.37235 7.217381e-18 5 -2336.56
AIC BIC deviance df.residual
1 4685.121 4703.829 1.389952e+13 162
monash_fit3 <- lm(Price~day+Bedroom2*Bathroom+Car+Landsize, data=monash)
tidy(monash_fit3)
term estimate std.error statistic p.value
1 (Intercept) -337844.7951 696993.9479 -0.484717 6.285400e-01
2 day -353.1323 256.8464 -1.374877 1.710918e-01
3 Bedroom2 223012.1594 210308.3294 1.060406 2.905581e-01
4 Bathroom 531837.4080 400710.8085 1.327235 1.863217e-01
5 Car -92786.0755 28777.1475 -3.224297 1.530910e-03
6 Landsize 1433.5867 143.2217 10.009560 1.283597e-18
7 Bedroom2:Bathroom -130577.5099 120918.2650 -1.079882 2.818197e-01
glance(monash_fit3)
r.squared adj.r.squared sigma statistic p.value df logLik
1 0.4162302 0.3943388 293667.3 19.01344 1.186591e-16 7 -2335.951
AIC BIC deviance df.residual
1 4687.902 4712.846 1.379848e+13 160
This is a selection of models above. The best R2 I get is about 0.41. I am curious about the best values obtained by the students. The deviance is very high for all. Interactions don’t help too much.
Based on the plot, we would expect the properties with one bathroom to have negative trends, and properties with 2 bathrooms have generally increasing prices for 3 bedroom but not for 4 bedrooms. This suggests an interaction term should help.
There should be some interpretation of the final model reported. For example, in the models above it would suggest the negative relationship for some levels of bedroom/bathroom. Surprisingly having a car space decrease the price, perhaps having an interaction term here could be useful to understand if it is contingent on the bedrooms or bathrooms. Generally the higher the land size the higher the price.
It would be good to see in the report that they break down the model, by substituting some values for the explanatory variables and predicting the price. This helps understand the interplay of the different variables.
One point for overall report organisation and readability.